[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant #3603

ErwinTerpstra · 2026-01-19T10:57:51Z

Proposed changes

Support for packed 4-bit floating point for both A and B tensors in block scale gemm. Tested with A using 1D block scale and B using 2D block scale. Works for both the "regular" and Preshuffle-B pipelines. Note that the regular pipeline stores data in fp8 in LDS (as this is how int4 was implemented). The WP pipeline stores tensor A in fp4 in LDS and dequants in when loading to registers.

Changes include:

Add fp4 support to ABQuant example, with/without PreshuffleB
Tests for fp4 on both A and B tensors (a4w4) for base case, irregular sizes and preshuffle B pipeline.
Other changes:
- Add support to InterleavedPKTypeLoader for generic type conversions instead of just int4
- Add LUT for converting fp4 to fp8. Improves performance of 4K tensor by around 25% on gfx12. Disabled by default using TEST_convert_with_table.
- Some helper traits to work with packed or mixed precision types. Including a method to determine MFMA type based on input types

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…speed up compile times

… due to larger mfma tile size

chris-tsiaousis-hpc · 2026-01-22T14:59:48Z

include/ck_tile/core/utility/mfma_compute_types.hpp

+        using LargestInputType = largest_type_t<ADataType, BDataType>;
+        if constexpr(is_packed_type_v<LargestInputType>)
+        {
+            return t<fp8_t>{};


I wouldn't expect this to change anytime soon, but for maintainability reasons I'd consider adding a:

static_assert(sizeof(typename LargestInputType::type) == sizeof(fp8_t));

chris-tsiaousis-hpc · 2026-01-22T15:15:59Z

include/ck_tile/host/reference/reference_gemm.hpp

+        {
+            const BDataType pk_val  = b_element_op(b_k_n(index));
+            const fp32x2_t fp32_val = pk_val.to_fp32x2();
+            self(index)             = (index[0] & 1) ? fp32_val.hi : fp32_val.lo;


For a_acc you do (index[1] & 1) and for b_acc you do (index[0] & 1). The reason is not apparent immediately and the removed hunk always did (k & 1). As you've explained to me this is because A is MxK and B is KxN.

You may want to add a comment explaining it or -even better- make the code self-explanatory by doing something like

constexpr auto A_TENSOR_K_DIM = 1; constexpr auto B_TENSOR_K_DIM = 0; (index[A_TENSOR_K_DIM] & 1) (index[B_TENSOR_K_DIM] & 1)

chris-tsiaousis-hpc · 2026-01-22T15:20:22Z

include/ck_tile/ops/common/load_interleaved_pk_type.hpp

 CK_TILE_DEVICE void load_int4_tile(WarpTile& dst, const WarpWindow& src)
 {
-    if constexpr(std::is_same_v<SrcDataType, pk_int4_t>)
+    if constexpr(numeric_traits<SrcDataType>::PackedSize > 1)


Use is_packed_type_v here

chris-tsiaousis-hpc · 2026-01-22T15:39:12Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp

-        using WarpGemm = WarpGemmDispatcher<typename Problem::ADataType,
-                                            BTypeToUse,
+        using WarpGemm = WarpGemmDispatcher<typename Problem::ComputeDataType,
+                                            typename Problem::ComputeDataType,


NIT: I'm thinking whether it would make more sense to rename this to PrecomputedComputeDataType because compute is a verb and thus makes me think of a function.

Or ComputationDataType

I understand where you are coming from, but ComputeDataType is the existing convention for the MFMA input type in CK/CK Tile.

chris-tsiaousis-hpc · 2026-01-23T09:01:26Z

example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped.cpp

+    abquant_quantgrouped_fp4_instance_factory(lut);
+    abquant_quantgrouped_fp8_instance_factory(lut);
+    abquant_quantgrouped_bf8_instance_factory(lut);
+    abquant_quantgrouped_preshuffleb_fp4_instance_factory(lut);


Can't this and the non-preshuffleb variant be in the same file/function like we do on fp8 and bf8?

I split them specifically since the preshuffleb pipeline is really slow to compile. This way it can already start compiling simultaneously with the other instances, and we don't extend compile times by a single translation unit taking longer than necessary. Other instances (e.g. bquant instances) the preshuffleb are also split.

So for consistency actually we could also split fp8/bf8 instances to a preshuffle-specific file

chris-tsiaousis-hpc · 2026-01-23T09:18:00Z

example/ck_tile/38_block_scale_gemm/run_gemm_quant_example.inc

        std::cout << "The CPU verification result is:" << (pass ? "correct" : "fail") << std::endl;
+
+        // Calculate and display reference timing
+        using DurationType   = std::chrono::duration<double>;


You could directly use std::chrono::milliseconds

That would only give millisecond precision right? (Not that this does crucial timing, but I do look at 0.1ms precision)

ErwinTerpstra added 9 commits January 14, 2026 08:07

chore: split block scale example instances in more separate files to …

9891903

…speed up compile times

wip: fp4 scaffolding for abquant

e43fd33

feat: add fp4 decoding-while-loading to abquant pipeline

6dea234

feat: add support for fp4 CPU verification in abquant

d0cd610

chore: add time tracking to reference calculation

58088a5

feat: add a4w4 test for blockscale gemm

3d8bfdb

feat: optimize reference calculation by preconverting values to AccType

761ba1b

feat: add fp4 to fp8 look-up table

a477fb8

fix: reference to wrong ComputeDataType field in QuantProblem

72a94bd

ErwinTerpstra added the organization: streamhpc label Jan 19, 2026

ErwinTerpstra added 7 commits January 19, 2026 11:04

Merge branch 'develop' into eterpstr/206-block-scale-gemm-fp4-support

7563031

feat: type utilities for determining MFMA compute types

e76d18e

feat: packed fp4 for abquant weight preshuffle

37af217

feat: add separate tests for a4w4 base case, padding and preshuffleB

32d5757

fix: fp4 conversion on gfx950 attempting to use non-supported method

f55d902

fix: test case was using quant group sizes which don't work on gfx950…

c0d869b

… due to larger mfma tile size

chore: add fp4 preshuffleb mode to block scale example

9eab310

ErwinTerpstra changed the title ~~Support for a4w4 (fp4) in block scale gemm AB quant~~ [CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant Jan 21, 2026

Merge branch 'develop' into eterpstr/206-block-scale-gemm-fp4-support

ce9c308

krithalith requested a review from ex-rzr January 21, 2026 12:40

chris-tsiaousis-hpc reviewed Jan 22, 2026

View reviewed changes

chris-tsiaousis-hpc reviewed Jan 23, 2026

View reviewed changes

ErwinTerpstra added 4 commits January 23, 2026 09:57

chore: sanity check for packed types being 1 byte

116c82a

chore: clarify tensor dimension indices with constants

d735292

chore: replace traits check with specialized check for packed types

33ab9ba

Merge branch 'develop' into eterpstr/206-block-scale-gemm-fp4-support

8883173

[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant #3603

Are you sure you want to change the base?

[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant #3603

Conversation

ErwinTerpstra commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ErwinTerpstra commented Jan 19, 2026 •

edited

Loading